Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.12 - Performing Joins (clean one)")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext
pets = spark.createDataFrame(
    [
        (1, 1, 'Bear'),
        (2, 1, 'Chewie'),
        (3, 2, 'Roger'),
    ], ['id', 'breed_id', 'nickname']
)

pets.toPandas()
id breed_id nickname
0 1 1 Bear
1 2 1 Chewie
2 3 2 Roger
breeds = spark.createDataFrame(
    [
        (1, 'Pitbull', 10), 
        (2, 'Corgie', 20), 
    ], ['id', 'name', 'average_height']
)

breeds.toPandas()
id name average_height
0 1 Pitbull 10
1 2 Corgie 20

Performing Joins

There are typically two types of joins in sql:

  1. Inner Join is where 2 tables are joined on the basis of common columns mentioned in the ON clause.

    ie. left.join(right, left[lkey] == right[rkey])

  1. Natural Join is where 2 tables are joined on the basis of all common columns.

    ie. left.join(right, 'key')

source: https://stackoverflow.com/a/8696402

Question: Which is better? Is it just a style choice?

Option 1: Inner Join (w/Different Keys)

join_condition = pets['breed_id'] == breeds['id']

df = pets.join(breeds, join_condition)

df.toPandas()
id breed_id nickname id name average_height
0 1 1 Bear 1 Pitbull 10
1 2 1 Chewie 1 Pitbull 10
2 3 2 Roger 2 Corgie 20

What Happened:

  • We have 2 columns named id, but they refer to different things.
  • We can't uniquely reference these 2 columns (easily, still possible).
  • Pretty long join expression.

This is not ideal. Let's try renaming it before the join?

Option 2: Inner Join (w/Same Keys)

breeds = breeds.withColumnRenamed('id', 'breed_id')
join_condition = pets['breed_id'] == breeds['breed_id']

df = pets.join(breeds, join_condition)

df.toPandas()
id breed_id nickname breed_id name average_height
0 1 1 Bear 1 Pitbull 10
1 2 1 Chewie 1 Pitbull 10
2 3 2 Roger 2 Corgie 20

What Happened:

  • We have 2 columns named breed_id which mean the same thing!
  • Duplicate columns appear in the result.
  • Still pretty long join expression.

This is again not ideal.

Option 3: Natural Join

df = pets.join(breeds, 'breed_id')
df.toPandas()
breed_id id nickname name average_height
0 1 1 Bear Pitbull 10
1 1 2 Chewie Pitbull 10
2 2 3 Roger Corgie 20

What Happened:

  • No duplicated column!
  • No extra column!
  • A single string required for the join expression (list of column/keys, if joining on multiple column/keys join).

Summary

Preforming a natural join was the most elegant solution in terms of join expression and the resulting df.

NOTE: These rules also apply to the other join types (ie. left and right).

**Some might argue that you will need both join keys in the result for further transformations such as filter only the left or right key, but I would recommend doing this before the join, as this is more performant.

results matching ""

    No results matching ""